Concept dictionary based information retrieval

ABSTRACT

A method and apparatus are provided for generating and updating a concept dictionary  140  in respect of an information system  125  and for using that concept dictionary to assist in selecting queries and query terms for use in interrogating that information system  125.  A lexical reference source  115  is first used to generate queries semantically related to a query  110  entered by a user, and the answers returned for each query are analysed using a fuzzy processing technique ( 135 ) to determine semantic relationships between the queries. The queries and the determined relationships are recorded in a concept dictionary  140  for subsequent use.

This invention relates to information retrieval and in particular to amethod and apparatus for generating a concept dictionary in respect ofan information system for use in retrieving information from thatsystem.

It is often assumed in prior art electronic information access systemsthat a user understands something of the structure of the stored dataand the methods used to access those data to be able to access relevantinformation efficiently. In particular, the user may be expected to knowterms that appear in stored entries of potential interest and be able tochoose query terms that distinguish these entries from others stored inthe system. To help avoid this dependence on user knowledge, it is knownto use a thesaurus or ontology to convert queries expressed in theuser's favoured terms into queries that may enable the system toretrieve the most relevant entries. For example, if no entries are foundin response to a query including the word pizza, an ontology mightsuggest use of the term Italian restaurant instead. However, to begenerally applicable, such an ontology must, of necessity, be extremelybroad. Experience of the Artificial Intelligence (AI) community suggeststhat this approach is impractical and that it may be impossible toimplement a “universal” AI-based ontology containing human-level generalknowledge. In addition, because a universal ontology has to be extremelybroad, it tends to over-generalise queries. For example, the word carmight be replaced by reasonable synonyms such as auto, automobile, ormotorcar but it might also be replaced by machine, railway car, elevatorcar or gondola which are not relevant to the query.

According to a first aspect of the present invention there is provided amethod of generating a concept dictionary for use in querying aninformation system, comprising the steps of:

(i) receiving an information search criterion;

(ii) deriving, using a lexical reference source, at least one searchcriterion having related meaning to said received search criterion;

(iii) identifying sets of information in said information systemrelevant to said received search criterion and to said at least onederived search criterion;

(iv) analysing the identified sets of information to deriverelationships between said received search criterion and said at leastone derived search criterion in the context of said information system;and

(v) storing, in a concept dictionary, information relating to saidreceived and said at least one derived search criterion and torespective said derived relationships therebetween, for use in queryingsaid information system.

The method according to this first aspect of the present invention isparticularly applicable to a small subsystem such as an intranet ordatabase, being arranged to deduce the important concepts and theirrelationships in that limited domain. A local, system-specific conceptdictionary or ontology can be used to help a user to generalise,specialise or select equivalent queries and query terms for use insubsequent information retrieval activities without the user becominglost in over-generalisation.

Recognising that universal ontologies are too general to be of use forquery expansion in a relatively limited domain, preferred embodiments ofthe present invention attempt to extract only that subset of ontologicalinformation relevant to the query mechanism and the stored data in aspecific information system and to store that ontological information ina concept dictionary specifically relevant to that information system.The concept dictionary is derived with respect to the completeinformation system, and is not simply a property of the stored data.Interactions between the actual data stored and the mechanism used toaccess the data have been found to be important to understanding therelationships between queries; relationships that cannot be accuratelyderived from the stored data alone.

Preferably the concept dictionary is “fuzzy” in that it allows a conceptto be approximately equivalent to another concept, or to have partialmembership in a parent concept. Fuzzy modelling and processingtechniques are described for example in “Fuzzy Sets” by L. Zadeh,Journal of Information and Control, Volume 8, 1965, pp 338-353, and“Fuzzy Logic Controllers”, Parts 1 and 2, by C. Lee, IEEE Transactionson Systems, Management and Cybernetics, Volume 20, 1990, pp 404-435. Theapplication of fuzzy modelling techniques to relate concepts inpreferred embodiments of the present invention has been found to beparticularly advantageous. Consider, for example, a classified telephonedirectory. Those directory entries retrieved in response to a query term“garage” might include almost all directory entries that offer “carrepair”. From this it may be deduced that “car repair” is almost alwaysa more specific concept than “garage”. However, relationships derived inthis way cannot be guaranteed to be true in all cases. While aconditional probability might be used to relate entries if entries wereretrieved with complete certainty, in almost all query-answering systemsdealing with semi- or un-structured data different entries satisfy thequery to a greater or lesser degree. Since this degree of satisfactioncannot be treated as a pure probability, it is not possible to applystandard probability theory to the relation between two concepts.However, by treating rankings of entries as fuzzy memberships, uncertainrelationships between queries can be modelled, for example relationshipssuch as “car repair is almost always a more specific query term thangarage”.

According to a second aspect of the present invention there is provideda method of accessing sets of information stored in an informationsystem using information search criteria stored in a concept dictionarygenerated for the information system according to the method definedaccording to the first aspect of the present invention above, comprisingthe steps of:

(a) selecting a first information search criterion;

(b) using a search engine to identify one or more sets of information inthe information system relevant to said first information searchcriterion; and

(c) selecting at least one further information search criterion fromsearch criteria stored in the concept dictionary, semantically relatedto said first information search criterion according to informationstored in the concept dictionary, according to whether a more general, amore specialised or an equivalent search is required.

According to a third aspect of the present invention there is providedan information retrieval apparatus for accessing sets of informationstored in an information system, comprising;

an input for receiving an information search criterion;

deriving means for deriving, using a lexical reference source, at leastone search criterion having related meaning to said received informationsearch criterion;

retrieval means for identifying sets of information in said informationsystem relevant to said received search criterion and to said at leastone derived search criterion;

analysis means for analysing said identified sets of information toderive relationships between said received search criterion and said atleast one derived search criterion in the context of said informationsystem; and

updating means for storing, in a concept dictionary, informationrelating to said received and said at least one derived search criterionand to respective said derived relationships therebetween, for use inquerying said information system.

Preferred embodiments of the present invention will now be described inmore detail, by way of example only, with reference to the accompanyingdrawings of which:

FIG. 1 is diagram showing features of an information retrieval apparatusaccording to a preferred embodiment of the present invention;

FIG. 2 is flow diagram showing preferred steps in operation of theapparatus of FIG. 1; and

FIG. 3 is a diagram representing in graphical form an example ofknowledge stored in a concept dictionary generated according topreferred embodiments of the present invention.

An apparatus, according to preferred embodiments of the presentinvention, for use in retrieving information data sets from aninformation system, will firstly be described with reference to FIG. 1.

Referring to FIG. 1, a preferred information retrieval apparatus 100comprises a query editor and generator 105 arranged to receive an inputquery 110 entered by a user or otherwise retrieved from a store ofqueries. The query editor and generator 105 is arranged with access toan external lexical reference source 115 to enable one or more querieshaving a related meaning to the input query 110 to be derived, forexample by substituting a noun occurring in the input query 110 with asemantically related noun or phrase obtained from the external lexicalreference source 115. A lexical database suitable for this purpose isWordnet™, accessible over the Internet athttp://www.cogsci.princeton.edu/˜wn/.

A query execution and information retrieval module 120 arranged toreceive the input query 110 and each of the derived queries generated bythe query editor and generator 105, and to identify information datasets, stored in an information system 125, relevant to each of thereceived queries. The module 120 may be a conventional search enginearranged to implement a known information searching algorithm,preferably one arranged to calculate, for each identified set ofinformation, a weighting factor indicative of the degree of relevance ofeach identified set of information to the respective executed query.

Those sets of information identified by the information retrieval module120 as being relevant to the input query 110 in particular, are outputas answers 130. In addition the results of the information retrieval(120) in respect of the input query 110 and each of the queries derivedby the query editor and generator 105 are received by a knowledgeacquisition module 135, together with the input (110) and derived (105)queries themselves, for analysis. The knowledge acquisition module 135is arranged to execute an algorithm for deriving semantic relationshipsbetween the input (110) and derived (105) queries on the basis of theresults of information retrieval by the module 120 from the informationsystem 125. In particular, the knowledge acquisition module 135 isarranged to determine whether one of the queries, or terms comprised inthe query, represents a specialisation or a generalisation of another ofthe queries on the basis of the relative scope of information retrievedby the module 120. In this way, any semantic relationships suggestedwith reference to the external lexical reference 115 when generating thederived queries (105) are tested in the specific context of theinformation system 125 and a measure of the extent to which thesuggested relationships apply in that context is determined by theknowledge acquisition module 135. A store is provided to store a conceptdictionary 140 in respect of the information system 125, the conceptdictionary 140 comprising a record of each of the queries, input (110)and derived (105), and the respective measures determined by theknowledge acquisition module 135 of semantic relationships therebetween,or between terms comprised in those queries. As new input queries 110are received, the knowledge acquisition module 135 is able to update theconcept dictionary 140 by adding new queries and new relationships andby updating values associated with previously stored relationships,thereby capturing new “knowledge” about the concepts embodied in theinformation system 125 and in the user's choice of queries (110).

Once the concept dictionary 140 has been established through a period ofuse of the apparatus 100, it may be used by the query editor andgenerator 105 to enable a user to select further queries to use ininterrogating the information system 125 according to whether the userwishes to expand the scope of information retrieval, to reduce its scopeor merely to search the information system 125 using semanticallyequivalent queries. Each time the user does use the apparatus 100 toretrieve information, particularly when the user enters a new query 110not previously used, the knowledge acquisition module 135 is able toconstantly update and improve the store of “knowledge” in the conceptdictionary 140 for the ongoing benefit of users of the informationsystem 125.

In a preferred embodiment of the present invention, to be describedbelow, the knowledge acquisition module 135 and the concept dictionary140 are arranged, respectively, to process and to store fuzzyrelationships between queries and hence to provide a less precise (less“crisp”) and thus more appropriate measure of semantic equivalence forstorage in the concept dictionary 140. This has the advantage that linesof enquiry may be suggested to and selected by users of the apparatus100 that would not ordinarily have been apparent with more precise“crisp” processing, with the potential to yield more useful results fromthe information system 125. The decision to use fuzzy processingtechniques in preferred embodiments of the present invention recognisesthe fact that information retrieval on the basis of user-suppliedqueries is a relatively imprecise process. Fuzzy processing has thepotential to extract more useful information from the implicit andexplicit assumptions behind a user's choice of input query and the bodyof information in the information system 125 than is possible with crispprocessing of semantic relationships.

However, before discussing the preferred use of fuzzy processing by theapparatus 100, an example will described to show how the conceptdictionary 140 may be populated with “knowledge” acquired using “crisp”processing techniques.

Consider two queries Q1 and Q2, with their corresponding answer sets S1and S2 obtained by interrogating the information system 125. Assumethese answer sets to be completely certain, rather than weighted to somedegree of relevance. Assume thatQ1=“find a garage in Ipswich”andQ2=“find car repair in Ipswich”and that the information system 125 returns a set of answers to thesecond query, S2, which is a subset of S1. It may be deduced from thisthat “car repair” is a term having a more restricted meaning than theterm “garage”. A human expert is able to recognise cases ofgeneralisation and specialisation in queries, but known techniques canalso be used to achieve this automatically, for example with referenceto a lexical database such as Wordnet™, accessible over the Internet athttp://www.cogsci.princeton.edu/˜wn/, and able to supply, for example,hotel as a synonym for the noun inn. If this is a valid equivalence inthe context of the information system 125, it may be expected that theinformation system 125 would return an identical sets of answers inresponse to a query searching for hotels in a particular location and toa query searching for inns in the same location.

Formally, let Q(x) denote a query predicate that returns true or falseaccording to whether or not an entry x is relevant to the query Q. Thenthe set of solutionsSQ={x|Q(x)}is the set of all entries x that satisfy (are relevant to) the query Q.It can be stated that for two queries, Q and P:

-   -   Q generalises P if SP⊂SQ    -   Q specialises P if SQ⊂SP    -   Q is equivalent to P if SQ=SP

Consider the following set of queries and corresponding answers: idqueryid Query ansid Answer Entry 1 q1 car hire in Ipswich a1 Eurodollarrent a car 2 q1 car hire in Ipswich a2 Autorent (UK) 3 q2 car rental inIpswich a1 Eurodollar rent a car 4 q2 car rental in Ipswich a2 Autorent(UK) 5 q3 restaurant in Suffolk a3 Church Yards Seafood Restaurant 6 q3restaurant in Suffolk a4 Curry Inn 7 q3 restaurant in Suffolk a5 PassageTo India 8 q3 restaurant in Suffolk a6 Chicago Rock Cafe 9 q4 restaurantin Ipswich a5 Passage To India 10 q4 restaurant in Ipswich a6 ChicagoRock Cafe 11 q5 Indian restaurant in Suffolk a4 Curry Inn 12 q5 Indianrestaurant in Suffolk a5 Passage To India 13 q6 Indian restaurant inIpswich a5 Passage To India

By the above reasoning, the answers to queries q3, q4, q5 and q6 in thetable above can be used to deduce that Ipswich is more specific termthan Suffolk and that Indian Restaurant is a more specific term thanRestaurant. Such deduced information may be stored in a conceptdictionary 140 and used subsequently to assist users in generalising orspecialising their queries.

The relationships between queries or query terms as derived in theexample above are examples of “crisp” relationships. They are derived onthe basis that the answers to the submitted queries are certain. Inpractice this is not generally the case. The preferred approach for usein embodiments of the present invention is to extend the ideas above toallow partial relevance of answer entries to queries and to convert thecrisp relationships into fuzzy relationships. In this preferred approachthe definitions of generalisation, specialisation and equivalence areexpanded to cater for partial inclusion and approximate equality.

A method will now be described for deriving relationships betweenqueries using fuzzy processing techniques for implementation by theapparatus 100 and in particular by the knowledge acquisition module 135according to a preferred embodiment of the present invention.Preferably, the knowledge acquisition module 135 determines the degreesto which a query P generalises a query Q and to which the query Pspecialises the query Q for each pair of queries P and Q, in the contextof the information system 125, using a representation framework known asthe “mass assignment framework” in combination with a technique forcalculating conditional probabilities of fuzzy sets called “semanticunification”. These techniques are taught for example in the followingpublished documents: J. F. Baldwin (1992) in “The Management of Fuzzyand Probabilistic Uncertainties for Knowledge-based Systems.”, in theEncyclopedia of AI, edited by S. A. Shapiro, published by John Wiley(2^(nd) edition), pages 528-537; J. F. Baldwin (1992) “Mass Assignmentsand Fuzzy Sets for Fuzzy Databases” in Advances in the Shafer DempsterTheory of Evidence, edited by M. Fedrizzi, J. Kacprzyk and R. R. Yager,published by John Wiley; J. F. Baldwin and T. P. Martin (2001) in“Towards Inductive Support Logic Programming”, Proc. IFSA-NAFIPS 2001,Vancouver, pages 1875-1880; and J. F. Baldwin, J. Lawry, and T. P.Martin in “Efficient Algorithms for Semantic Unification”, in Proc.Information Processing and the Management of Uncertainty, 1996, Spain.

Considering firstly a proposition that a query P generalises a query Q.This proposition is represented by the ruleRelevant(P, E)←Relevant(Q, E)where E is an entry (set of information) identifiable in the informationsystem 125. The degree to which this rule applies in respect of thequeries P and Q may be calculated from the fuzzy conditional{x: Relevant(P,x)}|{x: Relevant(Q,x)}where x is a set of information in the information system 125, thecalculation being performed over mass assignment elements making upfuzzy answer relations SP and SQ. For example, suppose that execution ofthe query P by the information retrieval module 120 returns the fuzzyanswer relationSP={a1: 1, a2: 1, a3: 0.7, a4: 0.6}and execution of the query Q returnsSQ={a1: 1, a2: 0.8, a3: 0.5}

In these fuzzy answer relations, a1, a2, . . . etc. are answeridentifiers, e.g. as used in the table above, and the values are fuzzymembership values for each answer calculated for example by theinformation retrieval module 120 by conventional means andrepresentative of the degree to which the respective answer would beincluded in a response to the respective query by the information system125. Each value is essentially a measure of the relevance of the answerto the query as may be determined by any one of a number of knowninformation retrieval algorithms.

Intuitively, from an inspection of the fuzzy answer relations SP and SQ,the query P seems to be more general than the query Q, since Q returnsfewer answers from the information system 125 and lower membershipvalues in two cases (a2 and a3) than P. To calculate the degree ofsupport for the proposition that the query Q is generalised by P, a massassignment is firstly formed on each of the fuzzy answer relations, asfollows:m(SP)={<a1,a2>}: 0.3, {<a1,a2>, <a1,a2,a3>}: 0.1, {<a1,a2>, <a1,a2,a3>,<a1,a2,a3,a4>}: 0.6m(SQ)={<a1>}: 0.2, {<a1>, <a1,a2>}: 0.3, {<a1>, <a1,a2>, <a1,a2,a3>}:0.5where the notation{<a1>, <a1,a2>}: 0.3indicates a degree of support of 0.3, from an interval [0,1], for theset of relevant answers to be either a1 or both a1 and a2, the values,e.g. 0.3, being obtained by subtracting consecutive fuzzy membershipvalues in the fuzzy relations SP and SQ. For example, in the massassignment for SP, answer a1 cannot arise in isolation because theanswer a2 also has a fuzzy membership value of 1 in the fuzzy relationSP, so the probability mass for {<a1 >} is 0. However, the probabilitymass for {<a1,a2>} is 1-0.7=0.3, and that for {<a1,a2>, <a1,a2,a3>} is0.7-0.6=0.1, etc.

The next step is to use the “standard point semantic unification”algorithm, described for example in the last of the four referenceslisted above, to derive the degree of support for the ruleRelevant(P, E)←Relevant(Q, E)from the mass assignments m(SP) and m(SQ).

For each of the answer combinations arising for the query Q, thequestion to be asked in the semantic unification process is: is itpossible, and if so what is the probability that given a particularanswer combination for the query P, the answer combination for Q wouldarise? The answers to this question are presented for each of thequeries in the table below, where the mass assignments for SQ arewritten along the top of the table and those for SP are written down theleft hand side. {<a1>, <a1, a2>}: {<a1>, <a1, a2>, {<a1>}: 0.2 0.3 <a1,a2, a3>}: 0.5 {<a1, a2>}: 0.3 0 ½ × 0.3 × 0.3 ⅓ × 0.3 × 0.5 {<a1, a2>, 0½ × 0.1 × 0.3 ⅔ × 0.1 × 0.5 <a1, a2, a3>}: 0.1 {<a1, a2>, 0 ½ × 0.6 ×0.3 ⅔ × 0.6 × 0.5 <a1, a2, a3>, <a1, a2, a3, a4>}: 0.6

Taking the first column, first row, it can be seen intuitively there isno possibility that if the answer to a query was <a1>alone that theanswer to the query can be a1 and a2 (<a1,a2>). However, in the firstrow, second column the question asked is whether the answer could be(<a1,a2>, <a1,a2,a3>} given that it was {<a1>, <a1,a2>}. The probabilityof this is the product of the individual probability masses multipliedby a factor indicative of the likelihood of the common answercombinations arising within the given answer combination. In the case ofthe first row, second column, assuming <a1>and <a1, a2>to be equallylikely gives the factor ½ since if the answer is <a1, a2>then the answercould be {<a1,a2>, <a1,a2,a3>} whereas if the answer is <a1>then itcannot. Each cell is weighted by the corresponding likelihood factor andthe product of the respective probability masses, and the overall degreeof support (semantic unification value) for the ruleRelevant(P, E)←Relevant(Q, E)is calculated as the sum over all cells in the table, giving a semanticunification value for this rule of 0.433.

A similar exercise can be carried out to test the support for the ruleRelevant(Q, E)←Relevant(P, E)which gives for this example a semantic unification value 0.548.

The knowledge acquisition module 135 is arranged to perform the fuzzyanalysis described above in respect of each combination of queriesselected from the input query 110 and the corresponding queriesgenerated by the query editor and generator 105, using the correspondinganswer responses obtained from the information system 125 by the queryexecution and information retrieval module 120. The semantic unificationvalues representing the degree of support for generalisation and forspecialisation of one query by another are calculated and stored by wayof an update to the concept dictionary 140, along with the respectivequeries themselves (if not already stored).

The process of updating the concept dictionary 140 starting from receiptof an input query 110 can be summarised and will now be described withreference to the flow diagram of FIG. 2.

Referring to FIG. 2, and additionally to FIG. 1, at STEP 200 an inputquery 110 is received by the query editor and generator 105 in theapparatus 100. At STEP 205, the query editor and generator 105 generatesa set of queries related semantically to the input query 110 withreference to an external lexical reference source 115 such as Wordnet,referenced above. In particular, the external lexical reference source115 is used to obtain, for a noun of the input query 110, at least oneof three types of semantically related noun, as follows (these areWordNet options, for example):

-   -   Synsets—roughly equivalent terms    -   Hypernym—super types (less restricted terms)    -   Hyponym—subtype (more restricted terms)

Each of the returned nouns is used to generate a related query byreplacing the respective noun in a copy of the input query 110.

At STEP 210, the input query 110 and the related queries generated atSTEP 205 are executed by the information retrieval module 120 toidentify sets of information stored in the information system 125relevant to each of those queries. Preferably, in order to distinguishone set of information identified by the information retrieval module120 from another, where distinct identifiers are not already defined andreturned for each different set of information in the information system125, then either the information retrieval module 120 itself or theknowledge acquisition module 135 are arranged to compare retrieved setsof information and to assign a unique identifier to each distinct set ofinformation so identified. If a particular related query returns noanswers, it is assumed to be an inappropriate change to the input query110 and is discarded.

Those sets of information identified as being relevant to the inputquery 110 in particular, or at least assigned identifiers and/orreferences to those sets of information, are output at STEP 215 as a setof output answers 130 in response to the input query 110. At STEP 220the information retrieval results output from the module 120, followingexecution of the queries at STEP 210, are analysed by the knowledgeacquisition module 135, along with the input query 110 and the relatedqueries from STEP 205, to determine the degree of support for each ofthe different semantic relationships, i.e. generalisation,specialisation and hence similarity, between those queries, using one ofthe methods described above. The results of this analysis are used atSTEP 225 to derive new knowledge about the information system 125, inparticular to deduce the position of a newly input or derived query orits constituent terms in a semantic hierarchy of queries and/or queryterms, and to analyse the queries themselves to deduce whether aparticular term of one query is semantically equivalent or related bygeneralisation or specialisation to a term of another query. In order todeduce semantic equivalence, the knowledge acquisition module 135 isarranged to interpret the semantic unification values associated withgeneralisation and specialisation of one query by another: for example,if both values are “high”, the respective queries are interpreted to besemantically equivalent and a value representative of the degree ofequivalence is taken to be the mean of the generalisation andspecialisation values; if the value for generalisation is “low” and thatfor specialisation is “high”, or vice versa, then specialisation orgeneralisation, respectively, holds; and if both values are “low”, thesemantic relationship between the queries is considered “weak”. Athreshold value or a fuzzy set may be defined by the knowledgeacquisition module 135 in respect of the information system 125, tocontrol interpretation of “high” and “low”. For example, “low” may be avalue “below 0.5”, or a fuzzy set may define “low” as {<0.3 isdefinitely low, >0.5 is definitely not low, 0.3-0.5 is fuzzy low}.However, the value ranges applicable in respect of a particularinformation system 125 may be adjusted by means of simple experiments.

At STEP 230, the results of the analysis step 220 and the derive newknowledge step 225 are used to update a concept dictionary 140 generatedand maintained by the apparatus 100 in respect of the information system125.

Preferably, the concept dictionary 140 comprises data representative ofa graph structure having nodes comprising query words or terms, e.g.“garage”, interlinked, where respective relationships have been derived,with the respective values indicating the degree of support calculatedfor the relationship—generalisation, specialisation or equivalence. Thelinks represented in the concept dictionary 140 may be followed from onenode to another to obtain a more generalised or more specialised word orphrase. Each link is a two-way link; following the link in one directionleads to a semantically more specialised node, in the other direction toa more general node, in the context of the respective information system125. Preferably, a hash table is stored as part of the conceptdictionary 140 to provide a link to a node of the graph structure from agiven word or phrase, e.g. one entered by a user at a user interface. Byway of example, a portion of a graph structure represented by datastored in the concept dictionary 140 will now be described withreference to FIG. 3.

Referring to FIG. 3, a graph structure is shown comprising a number ofquery nodes 300-330 and links therebetween representative of derivedsemantic relationships. In particular, the query node 300 “garage inIpswich” is shown linked to the query node 305 “buy car in Ipswich”.Stored semantic unification values, calculated in respect aspecialisation of the query node 300 by the query node 305, and viceversa, are also shown alongside the links 335 and 340 respectively. Alsoshown as part of each of the query nodes 305-330 are statements 345-370derived during STEP 225 of the process described above with reference toFIG. 2. These statements defining the strength of relationships betweenquery terms, measures (in the range [0,1]) of the degree of “similarity”between terms in the example of FIG. 3. For example, in query node 305,it has been calculated, using the semantic unification vales derived inrespect of the relationship between query node 305 and query node 300,that the term “buy car” is similar to the term “garage” with fuzzymembership value 0.273. That is, the terms have been found to berelatively dissimilar, as would be expected given the semanticunification value of 0.835 in support of specialisation by the querynode 305 of the node 300 and only 0.112 in support of specialisation bythe query node 300 of the node 305.

When presented in the form shown in FIG. 3, the contents of the conceptdictionary 140 can be seen to provide a useful source of information tousers wishing to make alterations to queries for use in interrogating arespective information system 125. In particular, having received theresults of a search of the information system 125 using a first querymade up of terms already known in the concept dictionary 140, it wouldbe clear from an inspection of links emanating from the nodecorresponding to the first query what alterations would need to be madeto either generalise or specialise the first query to, respectively,expand of reduce the scope of the returned query results with areasonable chance of success.

By way of example of the way in which a user may exploit the knowledgeembodied in a concept dictionary 140 generated using preferredembodiments of the present invention, consider that the followingknowledge has been accumulated in a concept dictionary 140, derived frompreviously used queries and query answers supplied by a respectiveinformation system 125, with each relationship having a high level ofsupport (high semantic unification value):

-   -   Italian restaurant generalisation_of pizza    -   takeaway food generalisation_of pizza    -   takeaway food generalisation_of fish and chips    -   takeaway food generalisation_of Chinese takeaway

If a user finds that no answers are returned by the information system125 in response to a query

-   -   “Find pizza in Ipswich”        then the knowledge (140) above may be used to suggest two        possible query generalisations to improve the chances of        obtaining useful answers, as follows:    -   “Find Italian restaurant in Ipswich”    -   “Find takeaway food in Ipswich”

If the user finds that the latter query was too general, i.e. itresulted in too many answers, then alternatives to this query may beoffered, with reference to the knowledge above, by specialisation:

-   -   “Find fish and chips in Ipswich”    -   “Find Chinese takeaway in Ipswich”

In this way, not only has the user been able to make relevantadjustments to the choice of query in order to vary the responses givenby the information system 125, but an alternative line of enquiry hasalso been suggested that may not have been apparent to the user of thatparticular information system 125.

Preferably, a user interface is provided with the apparatus 100 (notshown in FIG. 1) to enable a user to submit queries 110 to the apparatus100 and to receive output answers 130 from the apparatus 100 inresponse. The user interface may also be arranged to enable a user tonavigate knowledge stored in the concept dictionary 140, preferably withthe aid of a graphical user interface showing derived relationshipsbetween query nodes and query terms in a manner similar to that shown inFIG. 3, in particular to enable the user to select particular queriesand to request suggestions of more generalised, more specialised orsemantically equivalent queries to execute in a respective informationsystem 125.

Preferably, an apparatus 100 according to preferred embodiments of thepresent invention is implemented as a suite of computer programs usingthe Java programming language for running on a conventional servercomputer. The concept dictionary 140 is implemented using a conventionalrelational database management system such as Oracle™, although this toocan be implemented using Java.

Besides use as an information retrieval method and apparatus, preferredembodiments of the present invention may be used to test theeffectiveness of existing information retrieval systems. For example,the apparatus 100 may be linked to an existing information retrievalsystem so that the query generator and editor 105 is arranged to receive(in a monitoring role) queries entered by a user of the existing systemand the query execution and information retrieval module 120 is arrangedwith access to submit queries to the existing system and to receivecorresponding answers. Over a period of time in use, a conceptdictionary 140 generated in respect of the existing system, by theprocess described above with reference to FIG. 2, may be exported in aformat useable in the existing system and used to test the effectivenessof a query interface provided by the existing system, for example bycomparing the results of executing queries suggested by the existingsystem with the results of executing queries suggested with reference tothe generated concept dictionary 140.

In another mode of operation of preferred embodiments of the presentinvention, a bulk querying process may be implemented whereby a set ofqueries is built up and then sent into the apparatus 100 as inputqueries 110. This mode of operation may be particularly useful when aconcept dictionary 140 needs to be generated quickly rather than over anextended period of use of the apparatus 100 with a particularinformation system 125.

In another mode of operation of preferred embodiments of the presentinvention, a concept dictionary (140) generated in respect of aparticular information system 125 may be exported in a format useable inanother information retrieval system, also arranged with access to theinformation system 125, as a source of knowledge for use in querying theinformation system 125 through the other information retrieval system.

1. A method of generating a concept dictionary (140) for use in queryingan information system (125), comprising the steps of: (i) receiving aninformation search criterion; (ii) deriving (105), using a lexicalreference source (115), at least one search criterion having relatedmeaning to said received search criterion (110); (iii) identifying setsof information in said information system (125) relevant to saidreceived search criterion (110) and to said at least one derived searchcriterion; (iv) analysing the identified sets of information to deriverelationships between said received search criterion (110) and said atleast one derived search criterion in the context of said informationsystem (125); and (v) storing, in a concept dictionary (140),information relating to said received (110) and said at least onederived search criterion and to respective said derived relationshipstherebetween, for use in querying said information system (125).
 2. Amethod as in claim 1, wherein, at step (i), receiving an informationsearch criterion (1 10) comprises selecting an information searchcriterion stored in said concept dictionary (140).
 3. A method as inclaim 1, wherein, at step (ii), deriving at least one search criterionhaving related meaning comprises replacing a term of said receivedsearch criterion (110) with a related term having a more specificmeaning according to said lexical reference source (115).
 4. A method asin claim 1, wherein, at step (ii) deriving at least one search criterionhaving related meaning comprises replacing a term of said receivedsearch criterion (1 10) with a related term having a more generalmeaning according to said lexical reference source (115).
 5. A method asin claim 1, wherein, at step (ii) deriving at least one search criterionhaving related meaning comprises replacing a term of said receivedsearch criterion (110) with a related term having an equivalent meaningaccording to said lexical reference source (115).
 6. A method as inclaim 1, wherein, at step (ii), said lexical reference source (115) is athesaurus.
 7. A method as in claim 1, wherein, at step (ii), saidlexical reference source (115) is an ontological database.
 8. A methodas in claim 1, wherein, at step (ii), a plurality of search criteria arederived, each having related meaning to said received search criterion(110), and wherein at step (iv), the respective identified sets ofinformation are analysed to derive relationships between search criteriacomprised in said plurality of derived search criteria.
 9. A method asin claim 1, wherein, at step (iv), deriving relationships between saidsearch criteria comprises performing fuzzy processing of said derivedsearch criteria and respective said identified sets of information todetermine a measure of the generalisation and/or specialisation of onesaid search criterion over another in the context of said informationsystem (125).
 10. A method of accessing sets of information stored in aninformation system (125) using information search criteria stored in aconcept dictionary (140) generated for the information system (125)according to the method in claim 1, comprising the steps of: (a)selecting a first information search criterion; (b) using a searchengine to identify one or more sets of information in the informationsystem (125) relevant to said first information search criterion; and(c) selecting at least one further information search criterion fromsearch criteria stored in the concept dictionary (140), semanticallyrelated to said first information search criterion according toinformation stored in the concept dictionary (140), according to whethera more general, a more specialised or an equivalent search is required.11. An information retrieval apparatus (100) for accessing sets ofinformation stored in an information system (125), comprising; an inputfor receiving an information search criterion (110); deriving means(105) for deriving, using a lexical reference source (115), at least onesearch criterion having related meaning to said received informationsearch criterion (110); retrieval means (120) for identifying sets ofinformation in said information system (125) relevant to said receivedsearch criterion (110) and to said at least one derived searchcriterion; analysis means (135) for analysing said identified sets ofinformation to derive relationships between said received searchcriterion (110) and said at least one derived search criterion in thecontext of said information system (125); and updating means forstoring, in a concept dictionary (140), information relating to saidreceived (110) and said at least one derived search criterion and torespective said derived relationships therebetween, for use in queryingsaid information system (125).